---
title: "Custom metrics"
description: "Build specialized metrics, integrate external models, and reuse them across optimizations."
---

Use custom metrics when built-in metrics are not enough (domain-specific scoring, precise safety checks, unique multimodal checks). Start with the core Opik evaluation docs so you know what already exists:

- [Evaluation concepts](/evaluation/concepts) – terminology and lifecycle.
- [Metrics overview](/evaluation/metrics/overview) – default heuristic metrics (ROUGE, BLEU, Hallucination, etc.).
- [LLM-as-a-judge patterns](/evaluation/evaluate_agent_trajectory) – how Opik runs judge models against multi-turn traces.

## Design principles

- **Deterministic** – cache external model calls. Where supported by the model, set temperature to 0 and a seed value to increase the likelihood of repeated runs matching. Note that not all models guarantee deterministic outputs even with these settings.
- **Explainable** – always set `reason` on `ScoreResult` for better dashboards.
- **Composable** – wrap helpers into utility modules so multiple optimizers share them.
- **Layered** – start with single metrics, then combine them via `MultiMetricObjective` when you need trade-offs.
- **Cost** - consider the cost implications if you rely on compute and API calls for evaluations.

## Example: safety + completeness metric

```python
from opik.evaluation.metrics import AnswerRelevance
from opik.evaluation.metrics.score_result import ScoreResult
from some_safety_model import classify_risk

safety_model = classify_risk.Client()

def safety_and_completeness(item, output):
    relevance = AnswerRelevance().score(
        context=[item["answer"]], output=output, input=item["question"]
    )
    safety = safety_model.score(text=output)

    value = 1.0 if relevance.value > 0.75 and safety["label"] == "safe" else 0.0
    reason = f"Relevant={relevance.value:.2f}, safety={safety['label']}"

    return ScoreResult(name="safety_completeness", value=value, reason=reason)
```

## Metric building blocks

- **Single metrics** – implement one callable per concern (accuracy, tone, cost). Keep them reusable across prompts.
- **Multi-metric objectives** – combine single metrics with weights when you need to balance, e.g., accuracy (0.7) + style (0.3). See [Multi-metric optimization](/agent_optimization/best_practices/multi_metric_optimization) for templates.
- **LLM-as-a-judge** – call out to an evaluation model (OpenAI, Anthropic, etc.) inside the metric. Always include detailed prompts so results stay stable, and understand that reflective optimizers will inherit any noise from these judge calls.
- **Heuristics** – leverage built-ins from `/evaluation/metrics` instead of reinventing classic scores. You can compose heuristics with custom logic as shown above.

## Testing

- Write pytest cases that feed canned dataset items into the metric and assert expected scores.
- Run metrics against a golden dataset on CI to catch regressions.
- For multi-metric objectives, add tests that verify weight changes behave as expected (e.g., higher weight increases sensitivity).

## Related docs

- [Define metrics](/agent_optimization/optimization/define_metrics)
- [Evaluation concepts](/evaluation/concepts)
- [LLM judge workflows](/evaluation/evaluate_agent_trajectory)
- [Metrics overview](/evaluation/metrics/overview)
